GH-49753: [C++][Gandiva] Fix overflow in string functions.#49813
GH-49753: [C++][Gandiva] Fix overflow in string functions.#49813abtom87 wants to merge 2 commits intoapache:mainfrom
Conversation
14851b5 to
f1be8c1
Compare
There was a problem hiding this comment.
Pull request overview
Fixes potential integer-overflow/invalid-length issues in Gandiva string functions by adding overflow-checked allocation sizing and expanding unit tests to cover extreme and negative lengths.
Changes:
- Add overflow-checked allocation calculations for
quote_utf8andto_hex_binary. - Introduce a shared allocation-length helper for
gdv_fn_lower_utf8,gdv_fn_upper_utf8, andgdv_fn_initcap_utf8, including negative-length rejection. - Add/extend GoogleTest coverage for overflow and negative-length scenarios.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| cpp/src/gandiva/precompiled/string_ops.cc | Adds overflow-checked allocation sizing for quote_utf8 and to_hex_binary. |
| cpp/src/gandiva/precompiled/string_ops_test.cc | Adds overflow-focused tests for quote_utf8 and to_hex_binary. |
| cpp/src/gandiva/gdv_string_function_stubs.cc | Adds shared allocation-length helper and overflow checks; adjusts some integer types in translate. |
| cpp/src/gandiva/gdv_function_stubs_test.cc | Adds tests for max-length overflow and negative-length handling in upper/lower/initcap and translate. |
Comments suppressed due to low confidence (1)
cpp/src/gandiva/gdv_string_function_stubs.cc:792
- In the multibyte path, the
from_for <= from_lenloop readsfrom[from_for](to computelen_char_from) before checkingfrom_for == from_len. Whenfrom_forreachesfrom_len(the sentinel iteration used to handle the "not found" case), this does a 1-past-the-end read. Reorder the loop so thefrom_for == from_lencase is handled before indexing intofrom(or change the loop bounds to avoid the sentinel iteration and handle the fallback after the loop).
for (int32_t from_for = 0; from_for <= from_len; from_for += len_char_from) {
// Updating len to char in this position
len_char_from = gdv_fn_utf8_char_length(from[from_for]);
// Making copy to std::string with length for this char position
std::string copy_from_compare(from + from_for, len_char_from);
if (from_for == from_len) {
// If it's not in the FROM list, just add it to the map and the result.
std::string insert_copy_value(in + in_for, len_char_in);
// Insert in map to next loops
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| int32_t bad_in_len = std::numeric_limits<int32_t>::max() / 2 + 20; | ||
| out_str = to_hex_binary(ctx_ptr, binary_string, bad_in_len, &out_len); | ||
| EXPECT_EQ(out_len, 0); | ||
| EXPECT_EQ(out_str, ""); |
There was a problem hiding this comment.
These assertions compare const char* pointers (EXPECT_EQ(out_str, "")) rather than string contents, which can be non-deterministic. Use EXPECT_STREQ(out_str, "") or compare std::string(out_str, out_len) instead, and consider asserting/resetting the context error since the overflow path sets an error message.
| EXPECT_EQ(out_str, ""); | |
| EXPECT_STREQ(out_str, ""); | |
| EXPECT_TRUE(ctx.has_error()); | |
| ctx.Reset(); |
| } | ||
| } else { // If there are no multibytes in the input, work with std::strings | ||
| } else { | ||
| // If there are no multibytes in the input, work with std::strings |
There was a problem hiding this comment.
The comment in the has_multi_byte branch is inverted: this else executes when multibyte UTF-8 is present, but the comment says "If there are no multibytes". Please correct the comment to match the control flow to avoid confusion during future maintenance.
| // If there are no multibytes in the input, work with std::strings | |
| // If there are multibytes in the input, work with std::strings |
| int32_t alloc_length = 0; | ||
| // Check overflow: 2 * in_len | ||
| if (ARROW_PREDICT_FALSE( | ||
| arrow::internal::MultiplyWithOverflow(2, in_len, &alloc_length))) { | ||
| gdv_fn_context_set_error_msg(context, "Would overflow maximum output size"); | ||
| *out_len = 0; | ||
| return ""; | ||
| } | ||
|
|
||
| // try to allocate double size output string (worst case) | ||
| auto out = | ||
| reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, (in_len * 2) + 2)); | ||
| reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, alloc_length + 2)); | ||
| if (out == nullptr) { |
There was a problem hiding this comment.
quote_utf8 checks overflow for 2 * in_len, but the allocation size is alloc_length + 2. For boundary values (e.g., in_len == INT32_MAX/2), 2 * in_len fits in int32_t while + 2 overflows, which can lead to a negative/incorrect allocation size being passed to gdv_fn_context_arena_malloc. Please include the + 2 in the overflow-safe size computation (or separately check that alloc_length <= INT32_MAX - 2).
There was a problem hiding this comment.
fixed where/how?
There was a problem hiding this comment.
These lines check for the suggested overflow. using the AddwithOverflow method.
| auto ret = | ||
| reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, text_len * 2 + 1)); | ||
| reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, alloc_length + 1)); | ||
|
|
There was a problem hiding this comment.
to_hex_binary only special-cases text_len == 0. If text_len is negative, the overflow check may not trigger and alloc_length can become negative, which then gets passed to gdv_fn_context_arena_malloc (takes an int32_t size) and can result in invalid/huge allocations or crashes. Treat negative lengths as invalid (or at least as empty) by changing the guard to text_len <= 0 (and ideally setting an error for negative input, consistent with other functions).
d0280e4 to
8bb980f
Compare
| out_str = to_hex_binary(ctx_ptr, binary_string, neg_in_len, &out_len); | ||
| EXPECT_EQ(out_len, 0); | ||
| EXPECT_STREQ(out_str, ""); | ||
| ctx.Reset(); |
There was a problem hiding this comment.
Could we add a couple more boundary tests here just to make the hardening feel more complete? I’m thinking of cases like quote_utf8 right around the INT32_MAX / 2 boundary, substring_index with negative lengths, and maybe one concat_ws_* case where the combined output size would overflow.
There was a problem hiding this comment.
Yes I will add more tests to cover these, later today.
There was a problem hiding this comment.
I'd love to see the commit history in pull request - easier to review diffs that way.
You found few issues in the implementation via these new tests, correct?
There was a problem hiding this comment.
I'd love to see the commit history in pull request - easier to review diffs that way.
I am not sure why github is not showing the commit diffs. But it should be easier if you click on the "Files changed" tab and see the diff there.
You found few issues in the implementation via these new tests, correct?
Not really. One new issue was found by cop-pilot. One issue i found via Woverflow warning while compiling, that was the one with memcpy. And I expanded the tests.
8bb980f to
0ccb5ee
Compare
| return ""; | ||
| } | ||
|
|
||
| int32_t temp = 0; |
There was a problem hiding this comment.
There are also 3, 4 and 5 argument versions of concat_ws below
There was a problem hiding this comment.
Yes, they have been handled, as well. Perhaps a better way might be to approach them using variadic template or so.
0ccb5ee to
f02d531
Compare
| EXPECT_EQ(std::string(out_str, out_len), "'\\'\\'\\'\\'\\'\\'\\'\\'\\''"); | ||
| EXPECT_FALSE(ctx.has_error()); | ||
|
|
||
| int32_t bad_in_len = std::numeric_limits<int32_t>::max() / 2 + 20; |
There was a problem hiding this comment.
Could you check int32_t bad_in_len = std::numeric_limits<int32_t>::max() / 2 + 1? Might need to have UndefinedBehaviorSanitizer to actually catch the issue 🤔
There was a problem hiding this comment.
I believe that worked, I can try adding that too. What did not work though is (max/2) -1. The logic should allocate that memory, but i got a segfault, probably because the size was too huge.
Edit: added.
f02d531 to
e7ff043
Compare
…ction_stubs Fixes potential integer-overflow/invalid-length issues in gdv_string_function_stubs.cc. Expanded unit-tests. Incorporated review comments.
e7ff043 to
be80053
Compare
Fixes potential integer-overflow/invalid-length issues in string_ops.cc. Expanded unit-tests for concat_ws*. Incorporated review comments.
be80053 to
5713446
Compare
| EXPECT_EQ(std::string(out_str, out_len), "'\\'\\'\\'\\'\\'\\'\\'\\'\\''"); | ||
| EXPECT_FALSE(ctx.has_error()); | ||
|
|
||
| int32_t bad_in_len = std::numeric_limits<int32_t>::max() / 2 + 20; |
There was a problem hiding this comment.
I believe that worked, I can try adding that too. What did not work though is (max/2) -1. The logic should allocate that memory, but i got a segfault, probably because the size was too huge.
Edit: added.
| return ""; | ||
| } | ||
|
|
||
| int32_t temp = 0; |
There was a problem hiding this comment.
Yes, they have been handled, as well. Perhaps a better way might be to approach them using variadic template or so.
Rationale for this change
Fix the overflow in functions where strings are used
What changes are included in this PR?
Fixes overflow and handles negative lengths
Are these changes tested?
Yes, modified relevant google tests and ran them locally.
Are there any user-facing changes?
No
This PR contains a "Critical Fix". (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)